gh-148762: Speed up multiline regexes anchored by `^` by haampie · Pull Request #152339 · python/cpython

haampie · 2026-06-26T20:50:00Z

Multiline regexes of the form re.compile("^foo", re.MULTILINE) currently
fall into the generic search loop, which calls SRE(match) at every
position in the subject string. Since a ^-anchored (SRE_AT_BEGINNING_LINE)
pattern can only match at the start of the string or right after a linebreak,
we can instead jump from one line start to the next, skipping all the
intermediate positions.

Benchmarks show good improvements in runtime across UCS-1/2/4; full
numbers are in the issue.

Issue: Speed up multiline regexes anchored by ^. #148762

Signed-off-by: Harmen Stoppels <harmenstoppels@gmail.com>

eendebakpt

Claude suggests adding some tests for coverage. Not sure we need all of them, but including in details here for reference.

Details

``` def test_search_anchor_at_beginning_line(self): # gh-148762: a multiline "^" search jumps between line starts. These # cases pin the behaviour the optimization must preserve. for pattern, cases in [ ('^', [ ('', [(0, 0)]), ('abc', [(0, 0)]), ('\n', [(0, 0), (1, 1)]), ('\n\n', [(0, 0), (1, 1), (2, 2)]), ('a\n', [(0, 0), (2, 2)]), # match at end after \n ('\na', [(0, 0), (1, 1)]), ('a\nb\nc', [(0, 0), (2, 2), (4, 4)]), ('a\n\nb', [(0, 0), (2, 2), (3, 3)]), # empty line ('\n\n\n', [(0, 0), (1, 1), (2, 2), (3, 3)]), ]), ('^a', [ ('a', [(0, 1)]), ('a\na', [(0, 1), (2, 3)]), ('a\nba\na', [(0, 1), (5, 6)]), ('ba\nab', [(3, 4)]), ('a\n', [(0, 1)]), # no match-at-end: needs 'a' ('\na', [(1, 2)]), ('aa\naa', [(0, 1), (3, 4)]), ('a\n\na', [(0, 1), (3, 4)]), ('a\nĀa\na', [(0, 1), (5, 6)]), # UCS2 string kind ('Ā\na\nĀ', [(2, 3)]), ('a\n\U0001F600a\na', [(0, 1), (5, 6)]), # UCS4 string kind ('\U0001F600\na', [(2, 3)]), ]), ]: p = re.compile(pattern, re.MULTILINE) for s, expected in cases: with self.subTest(pattern=pattern, string=s): self.assertEqual([m.span() for m in p.finditer(s)], expected)

    # bytes (8-bit) path
    pb = re.compile(b'^a', re.MULTILINE)
    for s, expected in [(b'a\nba\na', [(0, 1), (5, 6)]), (b'a\n', [(0, 1)]),
                        (b'\na', [(1, 2)]), (b'abc', [(0, 1)])]:
        with self.subTest(string=s):
            self.assertEqual([m.span() for m in pb.finditer(s)], expected)

    # pos / endpos: the search may begin mid-line or on a line start
    pa = re.compile('^a', re.MULTILINE)
    self.assertEqual([m.span() for m in pa.finditer('xa\na', 1)], [(3, 4)])
    self.assertEqual([m.span() for m in pa.finditer('a\na', 2)], [(2, 3)])
    self.assertEqual([m.span() for m in pa.finditer('a\na\na', 1, 3)], [(2, 3)])
    self.assertEqual([m.span() for m in pa.finditer('a\na', 0, 1)], [(0, 1)])

    # sub / subn / split also drive search()
    pc = re.compile('^', re.MULTILINE)
    self.assertEqual(pc.sub('#', 'a\nb\nc'), '#a\n#b\n#c')
    self.assertEqual(pc.sub('#', 'a\nb\n'), '#a\n#b\n#')
    self.assertEqual(pc.subn('#', 'a\nb\n'), ('#a\n#b\n#', 3))
    self.assertEqual(pc.split('a\nb'), ['', 'a\n', 'b'])
    self.assertEqual(pc.split('a\nb\n'), ['', 'a\n', 'b\n', ''])

</details>

eendebakpt · 2026-06-29T20:07:52Z

+            while (ptr < end && !SRE_IS_LINEBREAK(*ptr))
+                ptr++;
+            if (ptr >= end)
+                return 0;


This could be

Suggested change

while (ptr < end && !SRE_IS_LINEBREAK(*ptr))

ptr++;

if (ptr >= end)

return 0;

+#if SIZEOF_SRE_CHAR == 1

ptr = memchr(ptr, '\n', end - ptr);

if (ptr == NULL)

return 0;

#else

while (ptr < end && !SRE_IS_LINEBREAK(*ptr))

ptr++;

if (ptr >= end)

return 0;

#endif

(I did not benchmark, not sure it is worth the change)

I had another issue/PR where I tried something like this, but it's hard to optimize when you don't know the distribution of \n character in the "haystack". See #148729 (comment); on the macbook it was hard to beat a hand-written loop:

The regression on darwin is because the letter i has a density of 2.88% in the corpus; the cross-over density is apparently about 2%, below which memchr is faster.

Based on wc -cl $(find -type f -name '*.py') from cpython's own sources, there are 988799 lines and 36153429 bytes, or a newline character density of 2.7%, meaning on my macbook the memchr would likely be slightly worse. But it depends on the use case.

So, I would hold off on combining different types of optimizations. This PR is about reducing the number of expensive match function calls.

haampie added 3 commits June 26, 2026 19:53

pythongh-148762: speed up caret match in regexes

ab4bf9f

Signed-off-by: Harmen Stoppels <harmenstoppels@gmail.com>

Reduce control flow nesting

134b179

news

d62cc5b

Signed-off-by: Harmen Stoppels <harmenstoppels@gmail.com>

bedevere-app Bot mentioned this pull request Jun 26, 2026

Speed up multiline regexes anchored by ^. #148762

Open

bedevere-app Bot added the awaiting review label Jun 26, 2026

drop cast cause not used consistently elsewhere

5d1149a

Signed-off-by: Harmen Stoppels <harmenstoppels@gmail.com>

eendebakpt reviewed Jun 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-148762: Speed up multiline regexes anchored by `^`#152339

gh-148762: Speed up multiline regexes anchored by `^`#152339
haampie wants to merge 4 commits into
python:mainfrom
haampie:hs/fix/multiline-caret

haampie commented Jun 26, 2026 •

edited

Loading

Uh oh!

eendebakpt left a comment

Uh oh!

eendebakpt Jun 29, 2026

Uh oh!

haampie Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Uh oh!

Conversation

haampie commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eendebakpt left a comment

Choose a reason for hiding this comment

Uh oh!

eendebakpt Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

haampie Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

haampie commented Jun 26, 2026 •

edited

Loading